Transform Unstructured Text into Structured Data

The MITTENSS Research Workshop

Wei-Chu Chen

2026-01-30

Agenda

We only have 2.5 hours… 😱

  • Not included

    • What is LLM/AI
    • The fundamentals and syntax of R
    • Text data cleaning & wrangling

Agenda

  • Included

    • What is API, and why use it
    • The overall workflow of processing text data with LLMs via API
    • The required or helpful techniques: Markdown & JSON
    • Extra: How to improve LLM’s performance

Today’s Objectives

Let’s be realistic

  • For those who are familiar with R

    Get ideas about how to use LLM to process data via API, and modify the workflow to your own data/projects

  • For those who are new to R

    Know where to start and the learning path

Materials

Github repo https://github.com/ibertchen/epic_llm_workshop

Posit Cloud https://posit.cloud if you don’t have R installed on your computer

Case demo

Workflow

A simplified workflow:

  1. Prepare data in tidy format(-ish)
    • web scraping, importing text from web pages or PDFs ❌
  2. Write system and behavior prompts
    • Markdown for general commands
    • JSON schema to define output data structure
  3. Send data to the LLM server via API
  4. Wait for data processing… ⌚
  5. Retrieve output

What is API?

  • An Application Programming Interface (API) is a set of rules that allows different software applications (e.g., R and OpenAI LLMs) to communicate and share data or functionality seamlessly.

  • Most mainstream LLM providers (OpenAI, Google, Anthropic, Hugging Face) have API services.

Why not just using ChatGPT? 🤔

OpenAI ChatGPT vs API

Feature ChatGPT (The Product) OpenAI API (The Engine)
Interface A website/app you can chat with. A “pipe” connecting your code to the AI.
Setup None (Sign in and start typing). Requires coding (Python, R, etc.).
Memory Remembers you across conversations. You must “teach” it memory yourself.
Pricing Flat monthly fee (Free, $20, or $200). Pay-per-use (billed by the word/token).
Control Limited (OpenAI sets the rules). High (You control creativity and logic).

Why use API?

Regarding data processing

  1. Large-Scale Automation

    • Batch Processing: You can process thousands of documents, survey responses, or datasets at once. While the Chat app requires you to copy-paste or upload files manually, the API can run 24/7 in the background without human intervention.

    • Cost Efficiency: As of 2026, the major service providers offer a “Batch Mode” for the API that gives you a 50% discount if you don’t need the results instantly (e.g., overnight processing).

Why use API?

  1. Precise Control

    • System Instructions: You can hard-code a specific “persona” or set of rules that the AI must follow for every single query.

    • Temperature & Top-P: You can adjust the “creativity” level. In research, you might set the Temperature to 0 to ensure the AI gives the same, most likely answer every time, minimizing “hallucinations.”

Why use API?

  1. Data Privacy & Security

    • Training Exclusion: Typically, data sent through the API is not used by the service providers to train their future models. This is crucial for researchers handling sensitive interview transcripts or unpublished datasets.

    • Institutional Compliance: The API allows for enterprise-grade security that often meets university ethics and IRB requirements.

Why use API?

  1. Structured Output

    • Data format: You can explicitly control the API to produce data in a specific format (e.g., string, numbers, date and time, etc).

    • File format: You can force the API to give you JSON or CSV data. This means the AI’s response is already formatted as a spreadsheet or database entry, ready for analysis in STATA, SPSS, or Excel.

MSU AI Policies and Tools

Important

Follow the MSU AI policies, espeically for confidential data.

MSU AI Policies

MSU AI Tools

(I am watching you!)

Types of API

Broady speaking, API services can be categorized into real-time APIs and batch APIs.

Feature Realtime API Batch API
Primary Goal Ultra-low latency / Interaction Cost efficiency / Volume
Response Time Immediate (usually <500ms) Up to 24 hours
Cost Standard 50% Discount
Connection WebSockets / WebRTC (Streaming) File Upload (JSONL)
Best Modality Speech-to-Speech, Live Text Bulk Text, Embeddings, Eval
Rate Limits Standard Significantly higher limits

Use Batch API When…

  • Immediate response is not required
  • Processing large-scale datasets
  • Cost efficiency is important
  • Require higher rate limits
  • Fast completion times (for large datasets).
  • Safe to disconnect the server after upload the files and resume the task later.

Access API Service

  • The API service is configured on OpenAI Platform
  • Contact the MSU IT for purchasing and accessing the MSU Enterprise-Licensed API service

Markdown Language

Markdown is a lightweight markup language that you can use to add formatting elements to plaintext text documents. Created by John Gruber in 2004, Markdown is now one of the world’s most popular markup languages. (source)

https://www.markdownguide.org/

Markdown Language

Markdown is incredibly helpful when working with LLM API services. It acts as a universal bridge between how humans read data and how machines process text.

  • Structural Clarity: Markdown helps the model understand the hierarchy and importance of your prompt. This leads to more organized and relevant responses.

  • Separation of Concerns: Markdown allows you to clearly separate different parts of a prompt. This prevents the model from getting confused between your commands and the data it needs to process.

  • Precise Code Handling: Markdown provides a standard way to denote code blocks, allowing the model to explicitly read your code as instructions or examples and generate outcomes in appropriate formats.

  • ❗Even most of the content generated by AIs are using Markdown.

JSON

https://www.json.org/json-en.html

JSON (JavaScript Object Notation) has become the standard for structured data exchange on the web. The data format provides a unique balance between machine efficiency and human readability.

  • Why care? Most LLMs currently generate structured data in JSON format.

JSON

The major advantages of JSON:

  • Lightweight and Efficient: JSON uses simple braces and colons ("name": "value"), resulting in smaller file sizes and faster data transmission.
  • Language Independence: Almost every modern programming language can read and output JSON files. This makes it a universal “bridge” between different systems.
  • Human-Readable Structure: JSON is organized into two easy-to-understand structures: Objects {} and Arrays []
  • Ideal for APIs: Its ability to represent nested, hierarchical data structures makes it perfect for complex data models. It can be used to control the data property generated by APIs.

JSON Example

  • Attributes can be flexible and vary across different objects.
  • Values can be in different formats (e.g., string, integer, boolean, or list)
{
      "name": "Pikachu",
      "category": "Mouse Pokémon",
      "is_legendary": false,
      "skill": ["Static", "Lightning Rod"],
      "primary_type": "Electric",
      "z_move_eligible": true
    },
    {
      "name": "Eevee",
      "category": "Evolution Pokémon",
      "is_legendary": false,
      "skill": ["Run Away", "Adaptability", "Anticipation"],
      "primary_type": "Normal",
      "evolution_count": 8
}

QUIZ!

How many times have you gotten the same outcome from asking an AI the same question?

Structured Output

The functionality of generating structured output reduces randomness and variation within LLMs.

OpenAI’s Structured model output

Google Gemini’s Structured outputs

Structured Output

LLMs with structured output functionality can

  • Forces the LLM to output in a predefined, machine‑readable format
  • Makes the response predictable and easy to parse programmatically
  • Improves accuracy and consistency, because the model must fill specific fields of a fixed structure every time
  • Reduces hallucinations and irrelevant content by constraining the model to valid values and fields defined in the schema

JSON Schema

We can ask LLMs to generate content that adheres to the provided JSON schema.

JSON Schema is a declarative language for defining structure and constraints for JSON data

Structured Outputs is a feature that ensures the model will always generate responses that adhere to your supplied JSON Schema (source)

JSON Schema Example

{
  "title": "PokemonSchema",
  "required": [
    "name",
    "category",
    "is_legendary",
    "skill",
    "primary_type"
  ],
  "properties": {
    "name": {
      "type": "string"
    },
    "category": {
      "type": "string"
    },
    "is_legendary": {
      "type": "boolean"
    },
    "skill": {
      "type": "array",
      "items": {
        "type": "string"
      },
    },
    "primary_type": {
      "type": "string"
    },
    "z_move_eligible": {
      "type": "boolean"
    },
    "evolution_count": {
      "type": "integer"
    }
  }
}

Hands-On Time

FINALLY!

R Packages

ellmer

Improve LLM’s Performance

  1. Always go for system prompt customization first (Aka prompt engineering)

  2. Retrieval-Augmented Generation (RAG)

  3. The last is Fine-Turning

People also say:

  • Start with the best models (OpenAI, Google, Claude)

  • Do not start with non-public information or data

    • Get a good grasp of how APIs work first, and only then think about security and data privacy.